home *** CD-ROM | disk | FTP | other *** search
- Path: comma.rhein.de!yaps!arno
- From: arno@yaps.rhein.de (Arno Eigenwillig)
- Newsgroups: comp.sys.amiga.programmer
- Subject: Re: realloc() bizzare behaviour...
- Message-ID: <MRm6y*XVf@yaps.rhein.de>
- Date: Sat, 17 Feb 1996 10:22:40 +0100
- References: <4fstmd$jco@bs33n.staffs.ac.uk>
- Organization: Yet Another Private Site in Meckenheim, Germany
- X-Copyright: This article may not be distributed on a CD-ROM
- or in printed form without prior written consent of the author.
- X-Newsreader: Arn V 1.04
-
- In article <4fstmd$jco@bs33n.staffs.ac.uk>, Wildfire writes:
-
- > I've done an LZW compressor.
-
- > struct string
- > {
- > unsigned int
- > length,
- > char
- > *string_body;
- > };
-
- > Got that? So first time I realloc(), I'm realloc()ing 257 of these
- > structures. Next time, 258, then 259, etc, etc.
-
- I recently wrote an LZW compressor as CS homework, too. I've investi-
- gated the possibilities to manage the LZW code table in an efficient
- way and completely abandoned the idea of one big array. I believe my
- method to be more efficient and therefore allow myself to be so harsh
- and describe it instead of hacking around with yours.
-
- First thing is, for compression you never need an LZW code's string in
- toto. If you investigate the compression's main loop, you will find
- out that it does not actually operate on a string S with a length of
- N, but it operates on a substring B (the first N-1 characters of S)
- and the trailing character C. Because B already has been assigned an
- LZW code, there is an entry in the code table for it, and you can
- take a pointer to it; or if B is only one character long, you can
- describe it by that character. Let us call the union which represents
- B either as pointer or as character by the name U.
-
- Now you no longer need to manage actual strings (which is unfriendly
- to do because of variable length and all) but only pairs (B/C).
-
- Second thing is, you need to manage the code table in a way that
- allows fast access. An array with the LZW code as index is just about
- the worst method imaginable, because you need to search linearily
- through it to find the LZW code for a given string. This will make
- your program O(N*N) or even slower.
-
- My method was to use an array of hash chains (just like the Amiga file
- system does, BTW). That is, you create an array of M list headers
- (singly linked lists suffice, so a "list header" actually is a pointer
- to the first element) and define a hash function returning values
- between 0 and (M-1).
-
- To put a string into the table, you compute its hash H, then put it
- into the H-th linked list. To search a string, you compute its hash H
- and linearily search through the H-th list. Technically speaking, this
- still gives you O(N*N) efficiency, but the Big-Oh-Notation kills
- benign coefficients: For 12 bit LZW compression, you cannot define
- more than 2^12 - 256 = 3840 LZW codes, so a few thousand hash chains
- result in nice short lists.
-
- Obviously, to combine both ideas in an efficient way, you need a hash
- function that can compute a string's hash by knowing the hash of its
- first N-1 characters and the last character. The usual
-
- hash = 0;
- for (i = 0; i < length; i++) {
- hash = (hash*C + s[i]) % M;
- }
-
- is apt for this. Note that M should be prime and C should be large
- enough to allow for even distribution of the hash values of two-
- character-strings.
-
- Alternatives to the hash chain array are binary search trees, flat
- arrays with rehashing, etc.
-
- Disclaimer: The above is the result of a one-day homework project.
- Mileage for real-world tours may vary.
-
- If you are willing to read TurboPascal *ugh* source code with German
- identifiers, I can provide an example implementation.
-
- -- __
- __/// Arno Eigenwillig /\ <arno@yaps.rhein.de> \/ PGP key available.
- \XX/ V+49-2225-5870 /\ <Arnooo @ #amigager> \/ MIME 8bit welcome.
- :-( Reduced net.presence until June 1996. )-:
-